feat: Add multi-column support for null-aware anti joins#19857
Open
viirya wants to merge 3 commits intoapache:mainfrom
Open
feat: Add multi-column support for null-aware anti joins#19857viirya wants to merge 3 commits intoapache:mainfrom
viirya wants to merge 3 commits intoapache:mainfrom
Conversation
f61eabb to
f6db769
Compare
This commit extends null-aware anti join functionality to support
multiple columns, enabling queries like:
SELECT * FROM t1 WHERE (a, b) NOT IN (SELECT x, y FROM t2);
and correlated multi-column NOT IN subqueries:
SELECT * FROM t1 WHERE (c2, c3) NOT IN (
SELECT c2, c3 FROM t2 WHERE t1.c1 = t2.c1
);
Changes:
Physical Execution Layer:
- Remove single-column validation restriction in HashJoinExec
- Extend NULL detection in probe phase to check ANY column for NULLs
- Extend NULL filtering in final phase to filter rows with ANY NULL column
- Add comprehensive unit tests for 2-column and 3-column joins
SQL Planning Layer:
- Allow tuple expressions in parse_in_subquery()
- Add validation for tuple field count matching
Query Optimization Layer:
- Update InSubquery validation to allow struct expressions
- Skip type coercion for struct expressions (handled in decorrelation)
- Implement struct decomposition in decorrelate_predicate_subquery
- Decompose struct(a, b) into individual join conditions a = x AND b = y
- Handle both correlated and non-correlated multi-column subqueries
Test Coverage:
- Add 7 new SQL logic test cases (Tests 19-25)
- Add 3 unit test functions with 15 test variants (5 batch sizes each)
- Cover 2-column, 3-column, empty subquery, and NULL patterns
- Include correlated multi-column NOT IN from issue apache#10583
Test Results:
- 31/31 null-aware anti join tests passing
- 369/369 total hash join tests passing
- All optimizer tests passing
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add test coverage for multi-column IN subqueries to verify that the struct expression support works correctly for both negated (NOT IN) and non-negated (IN) cases. Tests added to subquery.slt: - Test 1: Basic two-column IN - Test 2: Multi-column IN with no matches - Test 3: Multi-column IN with NULL values (verifies non-null-aware behavior) - Test 4: Three-column IN - Test 5: Correlated multi-column IN - Test 6: Verify logical plan shows LeftSemi with multiple join conditions - Test 7: Multi-column IN with empty subquery - Test 8: Multi-column IN with WHERE clause in subquery These tests complement the multi-column NOT IN tests in null_aware_anti_join.slt and verify that struct decomposition (converting `(a, b) IN (SELECT x, y ...)` into `a = x AND b = y`) works correctly for LeftSemi joins. Key differences from NOT IN: - IN uses LeftSemi join (not null-aware) - IN does not use CollectLeft partition mode - NULL values don't match in regular semi joins (two-valued logic) Related to multi-column null-aware anti join implementation.
- Collapse nested if statement in invariants.rs (clippy::collapsible_if) - Collapse nested if statement in hash_join/exec.rs (clippy::collapsible_if) - Use unwrap_or_else instead of unwrap_or for function calls in decorrelate_predicate_subquery.rs (clippy::or_fun_call)
d07982f to
5662114
Compare
Contributor
|
Thanks @viirya I'll check it soon, afaik there is no TPC* like queries that cover multiple column null aware anti joins, so it would be prob nice to have a bench in future to make sure no performance regression introduced with future PRs |
Contributor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
This commit extends null-aware anti join functionality to support multiple columns, enabling queries like:
and correlated multi-column NOT IN subqueries:
Changes:
Physical Execution Layer:
SQL Planning Layer:
Query Optimization Layer:
Test Coverage:
struct expression support works correctly for both negated (NOT IN)
and non-negated (IN) cases.
Are these changes tested?
Are there any user-facing changes?